15 research outputs found

    Polly's Polyhedral Scheduling in the Presence of Reductions

    Full text link
    The polyhedral model provides a powerful mathematical abstraction to enable effective optimization of loop nests with respect to a given optimization goal, e.g., exploiting parallelism. Unexploited reduction properties are a frequent reason for polyhedral optimizers to assume parallelism prohibiting dependences. To our knowledge, no polyhedral loop optimizer available in any production compiler provides support for reductions. In this paper, we show that leveraging the parallelism of reductions can lead to a significant performance increase. We give a precise, dependence based, definition of reductions and discuss ways to extend polyhedral optimization to exploit the associativity and commutativity of reduction computations. We have implemented a reduction-enabled scheduling approach in the Polly polyhedral optimizer and evaluate it on the standard Polybench 3.2 benchmark suite. We were able to detect and model all 52 arithmetic reductions and achieve speedups up to 2.21×\times on a quad core machine by exploiting the multidimensional reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho

    Architecture-parametric timing analysis

    No full text
    Abstract—Platforms are families of microarchitectures that implement the same instruction set architecture but that differ in architectural parameters, such as frequency, memory latencies, or memory sizes. The choice of these parameters influences execution time, implementation cost, and energy consumption. In this paper, we introduce the first general framework for architecture-parametric timing analysis (APTA). APTA computes an expression that bounds the worst-case execution time (WCET) of a program in terms of architectural parameters. This enables to configure a platform, at design or even at run time, in a way that is guaranteed to meet all deadlines, while minimizing implementation cost and/or energy consumption. We demonstrate the feasibility of our approach by imple-menting APTA for a precision-timed (PRET) platform and by evaluating our implementation on Mälardalen benchmarks. I

    Input space splitting for OpenCL

    No full text

    Polly's Polyhedral Scheduling in the Presence of Reductions

    No full text
    The polyhedral model provides a powerful mathematical abstraction to enable effective optimization of loop nests with respect to a given optimization goal, e.g., exploiting parallelism. Unexploited reduction properties are a frequent reason for polyhedral optimizers to assume parallelism prohibiting dependences. To our knowledge, no polyhedral loop optimizer available in any production compiler provides support for reductions. In this paper, we show that leveraging the parallelism of reductions can lead to a significant performance increase. We give a precise, dependence based, definition of reductions and discuss ways to extend polyhedral optimization to exploit the associativity and commutativity of reduction computations. We have implemented a reduction-enabled scheduling approach in the Polly polyhedral optimizer and evaluate it on the standard Polybench 3.2 benchmark suite. We were able to detect and model all 52 arithmetic reductions and achieve speedups up to 2.21× on a quad core machine by exploiting the multidimensional reduction in the BiCG benchmark

    GPU First -- Execution of Legacy CPU Codes on GPUs

    Full text link
    Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications

    Runtime pointer disambiguation

    No full text
    International audienceIn order to optimize code effectively, compilers must deal with memory dependences.However, the state-of-the-art heuristics available in the literature totrack memory dependencies are inherently imprecise and computationally expensive.Consequently, the most advanced code transformations that compilers have todayare ineffective when applied on real-world programs.The goal of this paper is to solve this conundrum through the hybriddisambiguation of pointers.We provide a static analysis that generates dynamic tests to determine when twomemory locations can overlap.We then produce two versions of a loop: one that is aliasing-free - hence, easyto optimize - and another that is not.Our checks lets us safely branch to the optimizable region.We have applied these ideas on polly-llvm, a loop optimizer built on top of the llvm compilation infrastructure.Our experiments indicate that our method is precise, effective and useful: wecan disambiguate every pair of pointer in the loopintensive polybench benchmark suite.The result of this precision is code quality: the binaries that we generateare 10% faster than those that polly-llvm produces without our optimization,at the -O3 optimization level of llvm.Given the current technology to statically solve alias analysis, we believe thatour ideas are a necessary step to make modern compiler optimizations useful inpractice
    corecore